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Abstract 

A central problem in e-commerce is determining overlapping communities (clusters) among indi- 
viduals or objects in the absence of external identification or tagging. We address this problem by 
introducing a framework that captures the notion of communities or clusters determined by the relative 
affinities among their members. To this end we define what we call an affinity system, which is a set of 
elements, each with a vector characterizing its preference for all other elements in the set. We define a 
natural notion of (potentially overlapping) communities in an affinity system, in which the members of a 
given community collectively prefer each other to anyone else outside the community. Thus these com- 
munities are endogenously formed in the affinity system and are "self-determined" or "self-certified" by 
its members. 

We provide a tight polynomial bound on the number of self-determined communities as a function 
of the robustness of the community. We present a polynomial-time algorithm for enumerating these 
communities. Moreover, we obtain a local algorithm with a strong stochastic performance guarantee 
that can find a community in time nearly linear in the of size the community (as opposed to the size of 
the network). 

Social networks and social interactions fit particularly naturally within the affinity system framework 
- if we can appropriately extract the affinities from the relatively sparse yet rich information from social 
networks and social interactions, our analysis then yields a set of efficient algorithms for enumerating 
self-determined communities in social networks. In the context of social networks we also connect our 
analysis with results about (a, /3)-clusters introduced by Mishra, Schreiber, Stanton, and Tarjan lfT6l[T7l . 
In contrast with the polynomial bound we prove on the number of communities in the affinity system 
model, we show that there exists a family of networks with superpolynomial number of (a, /3)-clusters. 



1 Introduction 



Affinity Systems The problem of identifying endogenously^ formed overlapping communities or clusters 
arises in many contexts within e-commerce: finding overlapping communities in a social network, clustering 
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retail products using collaborative filtering, clustering documents using citation information, classifying 
videos using viewing logs, etc. In such settings one needs to cluster the set of objects into meaningful, 
potentially overlapping subsets by only using information about relations between the objects. In this paper 
we develop the notion of an affinity system to model these scenarios. 

An affinity system is a collection of elements with a set of "preferences" each of these elements has over 
other elements within the system. These preferences may be expressed as a vector of rankings, or, more 
generally, as a vector of non-negative weights representing affinities. For example, when clustering videos, 
affinities may represent the likelihood of the videos to be co-watched, with videos that are co-watched more 
often "ranking" each other higher. When clustering documents, a document will "prefer" documents it cites 
over documents it doesn't. 

Perhaps the most natural application of affinity systems is to the study of social networks. Social interaction 
is often determined by affinities among the members. For example, in daily life, we often stay more in touch 
with people we like more. When we go to a conference, we often hang out more with people with whom we 
share more interests. Therefore, these social interactions, and their manifestations as online social networks 
fit well within the affinity system paradigm. 

Endogenously Formed Communities in Affinity Systems A central question concerning groups of indi- 
viduals, documents, products, etc., is how to determine communities, or overlapping clusters that capture the 
coherence among their members. For example, in the context of retail products discussed above, it may be 
useful to automatically "tag" the products with multiple categories for subsequent personalized marketing. 
In the context of professional networks, a person may belong to multiple explicit or implicit communities, 
for example a scientist may simultaneously belong to the community of Economists and the community of 
Computer Scientist. The question of finding overlapping communities is closely related to the very well 
studied question of clustering iflQl . but is much more general, since now elements may (and will) belong to 
multiple communities. 

In this paper we formalize a natural notion of self-determined community and develop efficient algorithms to 
identify overlapping communities of this type as well as general bounds on the number of such communities. 
Self-determined communities correspond to subsets that collectively prefer each other more than they prefer 
those outside the subset, where preference is defined by the rankings or weights of the affinity system. 
These communities are endogenously formed in the affinity system. What is particularly nice about this 
formulation is that we do not require that the subsets be of pre-specified sizes. For example, a solution of the 
flexible capacity roommate problem would group together people who prefer living with each other to living 
with anyone else in another room. Switching to the context of social and professional networks, an academic 
community can be viewed as a group of scholars which appreciates the work of others in the community 
to that of the work of people outside their community. In all these cases, the overlapping communities or 
clusters are self-certified or self-determined. 

More formally, we study the mathematical structure of self-determined communities in an affinity system 
and design efficient algorithms for discovering them. In our most basic model, we have n members V = 
{ 1 , . . . , n} in an affinity system, and we assume each member i states a strict ranking 7Tj of all members in 
the order of her preferences. To evaluate whether a subset S of size \ S\ = k is a good community, imagine 
that each member s G S casts a vote for each of its k most preferred members 7r s (l : k). The number of 
votes that member i receives, (f>s(i) = \{i £ vr s (l : G S}\, is the collective preference given by S. 

We say S is self-determined if everyone in S receives more votes from S than everyone outside S. 

Different self-determined communities may have different degree of coherence or robustness depending on 
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both the fraction of votes received by the community members as well as the gap between the fraction of 
votes received by the community members and the non-community members. To capture this, we say S is a 

(9, a, f3) self-determined community, for < (3 < a < 1 and 9 > if 

• each member s G S casts a vote for each of its 9\S\ most preferred members ir s (l : 9\S\). 

(6) 

• for each i G S, the amount of vote i receives, 4> s (i) = \{i G vr s (l : ^jS'Djs G S}\, is at least a\S\. 

• for each j G" S, the amount of vote j receives, (jfpij) = \{j £ ir s (l : 6*| <S| ) | s G S}\, is at most f3\S\. 

We start by analyzing how many communities can exist in an affinity system. Interestingly, we show that for 
constants a, {3, 9 we have a polynomial bound of n ^ ^ 1 / )/") on the number of (9, a, /3)-self-determined 
communities. Our analysis, using probabilistic methods, also yields a polynomial-time algorithm for enu- 
merating these communities. Moreover, we show that our bound is nearly tight, by exhibiting an affinity 
system with n^ 1 /") (9, a, /3)-self-determined communities. 

We then present a local community finding algorithm that is very efficient for an interesting range of pa- 
rameters. This algorithm, when given robustness parameters 9, a, (3, and a member v G V, either returns 
a (9, a, /3)-self-determined community of size t in time 0(f(a, (3,9) • t log t) or an empty set. The algo- 
rithm satisfies the following performance guarantee: if a > 1/2, if v is chosen uniformly at random from 
a (9,ol, /3) -self-determined community S, then with probability £l(2a — 1), our local algorithm will suc- 
cessfully recover S and so in time dependent only (and nearly) on the \S\ and not on the size of the entire 
affinity system. As a consequence of this analysis, we can show that in the (natural) case when a > 1/2 
we obtain a near-linear algorithm for finding all self-determined communities, substantially improving on 
the polynomial-time guarantee discussed above. Quasi-linear local algorithms are particularly important 
in the context of studying internet-scale networks, where even quadratic-time algorithms are not feasible, 
and where one sometime does not have access to the entire network but only to a local portion of it. The 
quasi-linear algorithm is one of our main technical contributions, as its techniques can potentially be used 
to convert other polynomial cluster-detection algorithms into local quasi-linear algorithms - at least in the 
average case. 

We also study multi-facet affinity systems where each member may have a number of different rankings 
of other members. For example, member i may have two rankings iTij un and iTi :SC i ence , where first ranks 
members by how much fun i thinks they are and the second ranks them according to academic affinity. In 
this context, we say 5 is a self-determined community if there exists a vector of choices of rankings (in this 
case, in {fun, science}^) such that if members vote according to their associated choice, the resulting 
votes self-certify S. We prove that if each member has a constant number of rankings, all our results can be 
extended, even though there could be exponential number of combinations of rankings. 

Our results can be extended to weighted affinity systems where the affinities of each member are given 
by a numerical weighting rather than just an ordinal ranking. For example, member i may give her most 
preferred member weight 1, next two preferred members weight 0.7, next one weight 0.5, and so on. A 
weighted affinity system can be expressed as A = {V, a\, a n }, where a« is a n-dimensional vector 
o-i = {a-i,i, ■■■,0"i,n) and < a^j < 1 specifies the degree of affinity that i has for j. One can naturally 
define (9, a, /3)-self-determined communities for weighted affinity systems. The only requirement is that 
members are only allowed to cast votes up to a total weight of 9t when voting for a community of size t, 
while respecting the affinity system. We show that all our bounds and algorithmic results extend to weighted 
affinity systems with only a slight loss in the parameters. 
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Endogenously Formed Communities in Social Networks Our general formulation enables us to shed 
light on the challenging task of defining and finding overlapping communities in social networks lfl9l [T71 
021 Q31 [15]] • Typically, a social network can be viewed as graph G = (V,E), where the edges could be 
either undirected (e.g., the Facebook social network determined by friend list) or directed (e.g. the Twitter 
network). An edge could be unweighted or weighted (e.g., the Skype phone-call networks or the Facebook 
network based on the number of times that one person writes on the wall of others). 

It turns out that a social network can be realized as a projection of an affinity system. Indeed, although our 
affinity systems are typically dense, their projections as social networks can be very sparse as are many 
observed social networks. We can think of our observed social network interactions as being induced (in 
various ways) by the underlying latent set of affinities. To be precise, given a social network G = { V, W, w} 
with weights w = («%), we would like to recover the communities in the original affinity system. A natural 
way to do this is to lift the social network back to an affinity system A = { V, a\ , ... , a n } and then to solve the 
problem in the lifted system. For example, several natural approaches for lifting based on different beliefs 
about how the social network may have emerged from an underlying set of affinities include: 

1. Direct Lifting: One can directly lift to an affinity system by defining ajj = Wij if € E, 
otherwise a-ij = (we assume WLOG that u>ij 6 [0, 1])). 

2. Shortest Path Lifting: If G = (V, E) is an unweighted social network, and the shortest path distance 
from i to j is dij, one may define a^j = 1/dij. The shortest path lifting can be extended to weighted 
cases by appropriated normalization. 

3. Personal Page Rank Lifting: Let pi be the personal PageRank vector [2] of vertex i, we define ajj = 

p it j/max(pi). 

4. Effective Resistance Lifting: Let r« j be effective resistance of from i to j by viewing G a network of 
resistors, using l/w(e) as the resistance of e G E [9], we define a{j = min^ri^/r^j). 

Each style of lifting corresponds to a particular belief on how this social network may have emerged from a 
latent underlying affinity system. For instance, Direct Lifting corresponds to the belief that a social network 
G = (V, E), such as the Twitter network, arose from a latent affinity A' = (V, {a[, a' n }) by a process 
in which each member i connects to the di top most elements according to the affinity system A'. In other 
words, i follows j, i.e., (i, j) is a directed edge in this social network, if j is among the di top most elements 
of i according to A'. Similarly, one can think of Shortest Path Lifting as corresponding to the belief that 
the social network serves as an approximate spanner of the underlying affinity system [18], and Effective 
Resistance Lifting corresponds to the belief that a social network is approximately based on some spectral 
sparsification of those underlying affinities |20l . 

Given a social network, once we derive a corresponding affinity system A, we may use our notion of self- 
determined community and apply our algorithms and analysis to obtain communities in the original network. 
From our analysis for affinity systems, we immediately obtain that there is a polynomial number of such 
communities in a social network, and they can be enumerated in polynomial time. 

We note that while the input social network is potentially very sparse, appropriate lifting procedures can 
produce an affinity system better reflecting the true relationships between entities. Moreover, many can 
be performed locally, allowing for our local algorithm to determine meaningful communities especially 
efficiently. 
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We also note that our study of multi-facet affinity systems allows us to model and analyze communities in 
more complex social networks - such as such Google+ with circles which enable its users to share different 
things with different circles of people. This extension may also enable us to model interdisciplinary sub- 
fields according to scientific works or interactions. 

Self-determined Communities and (a, /3)-clusters In this paper, we also provide several new results for 
communities defined as (a, /3)-clusters, a notion introduced by Mishra, Schreiber, Stanton, and Tarjan lfl6ll 
for analyzing (unweighted) social networks. In their definition, S is an (a, /3) -cluster (for a > /?) if for every 

1 € S, the number of neighbors coming from S is at least a\S\ and for every j S, the number of neighbors 
coming from S is at most j3\S\. We prove that there exists a family of networks with superpolynomial 
number of (a, /3)-clusters. For instance, if a = 1 and a — (3 = 0.01, then in G(n, 1/2), the Erdos-Renyi 
random graph with 1/2 edge probability, the expected number of (a, /3)-clusters is n n( - logn \ We also show 
that under the assumption that the planted clique problem is hard, even finding a single (a, f3) -cluster is 
computational hard. Interestingly, our notion of communities in social networks obtained via direct lifting 
is quite similar to the notion of (a, /3)-clusters, with the only difference that we bound the total amount of 
votes a member may castH This twist seems to be essential to obtain only a polynomial number of such 
communities and to be able to enumerate them in polynomial time. 

Related Work Problems of clustering or grouping data (based on network or pairwise similarity information 
or ranked data) have been extensively studied in many different fields. The classic goals have been to produce 
either a partition or a hierarchal clustering of the data ifTOl |6l @] [Sj 13 [13 . With the rise of online social 
networks, there has been significant recent interest in identifying overlapping clusters, or communities, in 
networks ranging from professional contact networks to citation networks to product-purchasing networks, 
with many heuristics and optimization criteria being proposed lfl4l,[T5l[T9l[T6l[T7l[T2ll . However, much of this 
work has disallowed natural communities such as those containing highly popular nodes ||2~T1 l22l [T4l [T5l or 
not given general guarantees on the number or computation time needed to find all overlapping communities 
meeting natural criteria (7J[T6l[T7l[T2|- By contrast, our new formalization leads to natural communities and 
efficient algorithms for identifying all such communities. Additionally, our model allows us to deal with 
asymmetries in the input in a very natural way. 

Independently, in recent work [3] consider several assumptions (that are between worst case and average 
case) concerning community structure and provide efficient algorithms in these settings. Remarkably, while 
their setting is somewhat different from ours some of their algorithms are similar in spirit. 

2 Preliminaries and Notation 

In our most basic model, we consider an affinity system with n members V = {1, ...,n} and assume 
that each member i £ 7 states a strict ranking 7Tj of all members in the order of her preferences. Let 
IT = {iii , . . . , ir n }. For t > 0, S C V, i G V we denote by v%(i) the number of members in S that place i 
among the topmost t elements of their preference list. That is v l s (i) = \{s E S\i G 7r s (l : t)}\ . For 6 > 0, 

we let (j) e s (i) := v§ (i). We define a natural notion of self-determined community as follows: 



2 For example, for the direct lifting of G, the notion of the community we obtain is as follows: S is a (8, a, /3) -self-determined 
community in G if every i G S receives at least a|S| collective vote and everyone not in S receives at most f3\S\ collective vote. 
If G is unweighted, £ E, i £ S, and di is the out-degree of i, then one way to set up the affinity system is to let i contribute 
miri(l, 6\S\/di) to the collective vote of j. 
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Definition 1 Given three positive parameters 9, a, j3, where f3 < a < 1 and an affinity system (V, II) we 
say that a subset S ofV is an (9, a, (3) self-determined community with respect to (V, II) if we have both 

(1) For all ie S, c/) e s (i) > a\S\. 

(2) ForalljtS,<i> e s (j)<l3\S\. 

Throughout the paper, we will denote by 7 = a — (3. Fixing 9, we say that "i votes for j with respect to a 
subset S" if j G 7Tj(l : \9 \S\~\ ). When S is clear from the context we say that i votes for j. 

Note that communities may overlap. As a simple example, assume we have two sets A\ and A2 of size n/2 
with n/8 nodes in common (representing, say, researchers in Algorithms and researchers in Complexity). 
Assume each node in Ai \ Aj ranks first the nodes in A4 and then the nodes in Aj and that each node in 
Aj n Aj ranks the nodes in Ai U Aj arbitrarily. Then each A4 is a (1, 3/4, 1/4) self-determined community. 

We also consider (more general) weighted affinity systems, where the preferences of each member i involve 
numerical weightings (degrees of affinity) rather than just an ordinal ranking. A weighted affinity system is 
expressed as A = {V, a±, a n }, where a>i is a re-dimensional vector a, = (0^1, aj in ) and < a^j < 1 
specifies the degree of affinity that i has for j. For example, i may give her top-ranked node a weight of 1, 
she might have a tie between its second and third-ranked nodes giving both a weight of 0.7, and so on. If 
member i chooses not to vote for a given node, this can be modeled by giving that node a weight of 0. 

We can naturally extend our notion of (9, a, f3) -self-determined communities to weighted affinity systems. 
In the definition of voting, the only requirement is that members are only allowed to cast votes up to a total 
weight of 9t when voting for a community of size t. For example, to evaluate whether a subset S is a good 
community, each member s € S casts a weighted vote as follows: s determines a prefix of the weights 
(sorted from highest to lowest) of total value 9\S\ and zeros out the rest. If there are ties at the boundary, a 
natural conversion is to scale down the weights of those nodes just at the boundary to make the sum exactly 

equal to In general, we denote the resulting vector (after capping the amount of vote a member casts 

o\ si 

when voting for a community of size t) as a s . The amount of the weight that member i € V receives from 
S is & e s (i) = 2~2sgs a fl S '- Given these, we can define an (9, a, (3) weighted self-determined community as 
follows: 

Definition 2 Given 9,a, {3 > 0, f3 < a < 1 and an weighted affinity system (V, A) we say that a subset S 
ofV is an (9, a, (3) weighted self-determined community with respect to (V, A) if we have both 

(1) For all ie S, a|(») > a\S\. 

(2) For all j # S, 4(j) < 0\S\. 

We note that given an (weighted) affinity system and a set S we can test in time polynomial in n whether 
a proposed set S is a (9, a, f3) -self-determined community or not. Also, fixing a (9, a, /3)-self-determined 
community S, one can easily show that there exists a multiset U of size ^(7) = 2 log (4n)/7 2 such that the 
set of elements i voted by at least a (a — 7/2) fraction of U (or in the weighted case, the set of elements i 
receiving (a— r y/2)\U\ total vote from U) is identical to S. This then implies a very simple quasi-polynomial 
procedure for finding all self-determined communities, as well as an n°( lo s n /7 ) U pp e r bound on the number 
of (9, a, /3)-self-determined communities. (See Appendix lA.il for details). 



6 



In this paper we present a multi-stage approach for finding an unknown community in an affinity system that 
provides much better guarantees for interesting settings of the parameters. At a generic level, this algorithm 
takes as input information I about an unknown community S and outputs a list C of subsets of V s.t. if 
information I is correct with respect to S, then with high probability C contains S. This algorithm has 
two main steps: it first generates a list L\ of sets S\ s.t. at least one of the elements in C\ is a rough 
approximation to S in the sense that S\ nearly contains S and it is not much larger than S. In the second 
step, it runs a purification procedure to generate a list C that contains S. (See Algorithm Q]) Both steps 
have to be done with care by exploiting properties of self-determined communities and we will describe 
in detail in the following sections ways to implement both steps of this generic scheme. We also discuss 
how to adapt this scheme for outputting a self-determined community in a local manner, for enumerating all 
self-determined communities, as well as extensions to multi-facet affinity systems and applications of our 
analysis to social networks. 

Algorithm 1 A generic algorithm for identifying an unknown community S 
Input: Preference system (V, II), information / about an unknown community S. 

(1) Using information I to generate a list C\ of sets S\ s.t. at least one of the elements in C\ is a rough 
approximation to S. 

(2) Run a purification procedure to generate a list C s.t. at least one of the elements in C is identical to S. 

(3) Remove from the list C all the sets that are not self-determined communities. 
Output: List of self-determined communities C. 



3 Finding Self-determined Communities 

In this section we show how to instantiate the generic Algorithm [T] if the information we are given about the 
unknown community S is its size and the parameters 9, a, and fi. We show that this leads to a polynomial 
time algorithm in the case where 9, a, and /3 are constant. We start with a structural result showing that for 
any self-determined community S there exist a small number of community members s.t. the union of their 
votes contains almost all S. 

Lemma 1 Let S be a (9 , a, /3) -self-determined community. Let 7 = a — f3, M = log (16/^) /a. There 
exists a set U, \U\ < M s.t. the set Si = {i £ V\3s eU,ie vr s (l : 9\S\)} satisfies \S \ Si| < (7/16)|5|. 

Proof: Note that any subset S of S receives a total of at least q|5||5| votes from elements of S, which 
implies that for any such S there exists ig G S that votes for at least a\S\ members of S. Given this, we find 
the desired elements i±, . . . , iu £ S greedily one by one. Formally, let Si = S. Let i% G S be an element 
that votes for at least a a|<Si| elements in S\. Let S2 be the set S minus the set of elements voted by i\. In 
general, at step / > 2, there exists i\ G S that votes by at least a a fraction of Si. Let be the set Si 
minus the set of elements voted by %. We clearly have \Si+\\ < (1 — a)*|5i|, so |Sa/+i| < (t/16)|5i| for 
M = log (16/7)/<x By construction the set U = {ii, . . . , %m £ S} satisfies the desired condition. ■ 

Given Lemma [Q we can use the following procedure for generating a list that contains a rough approxima- 
tion to S which covers at least a 1 — 7/I6 fraction of S and whose size is at most log (16/7)|i?|. 
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Algorithm 2 Generate rough approximations 

Input: Preference system (V, II), information / (parameters 9, a, /3, size t). 

• Set£ = 0, 7 = a-/3, ki(6,a,j) = log (16/i)/a. 

• Exhaustively search over all subsets U of V of size ki(9, a, 7); for each set U add to the list C the set 
S\ C V of points voted by at least an element in U (i.e., S\ = {i G y|3s G £/, i G 7r s (l : 

Output: List of sets £. 



We now describe a lemma that will be useful for analyzing the purification step, suggesting how we convert 
a rough approximation to S into a list of candidate much-closer approximations to S. 

Lemma 2 Fix a (9, a, (3)-self-determined community S. Let 7 = a — (3, t = \S\, and S\ C V, \S\\ = M9t 
s.t. \S \ S\\ < 72/ 16. Let U be a set of k points drawn uniformly at random from S = S H S\. Let 
S2 be the subset of points in S\ that are voted by at least an a — 7 /2 fraction of nodes in U, i.e., S2 = 
{i G Sx\vf}(i) > (a - 7/2)|l7|}. Ifk = 81og(326»M/57)/7 2 , then with probability > 1 - 8, we have the 
symmetric difference \A(S%,S)\ < jt/8. 

Proof: We start by showing that the points in S are voted by at least a 7/2 larger fraction of S than the 
points in Si \ S. Let i G S. Since S is (9, a, /3)-self-determined, at least at points in S vote for i and since 
\S \ S\ < 7t/16 we get that at least (a — 7/16)t points in S vote for i. Since |5| < t, we obtain that at least 
a a — 7/ 16 fraction of points in S vote for i. Let j be a point in Si \ S. We know that at most fit points in S 
vote for j and since \S\ > (1 — j/16)t, we have that at most a a — 37/4 fraction of points in S vote for j. 

Fix i G S\. By Hoeffding's inequality, since U is a set of 81og(32^M/^7)/7 2 points drawn uniformly at 
random from S we have that with probability at least 1 — ^5/ (169 M) the fraction of points in S that vote 
for i is within 7/4 of the fraction of points in U that vote for i. These together with the above observations 
imply that the expected size of ^(5*2, 5*)| is (^5/(169 M))9Mt = ^St/IQ. By Markov's inequality we 
obtain that there is at most a 5 chance that \ A(S2, S)\ > 7t/16. Using the fact \S \ S\ < 7t/16 we finally 
get that with probability 1 - 5 we have | A(5 2j S) \ < 7i/8. ■ 

Algorithm 3 Purification procedure 

Input: Preference system (V, II), information / (parameters 9, a, (3, 7, k2(9, a, 7), A^^, a, 7), size t), list 
of rough approximations Ci. 

• For each element Si G £1, repeat AT 2 (0, a, 7) times 

• Sample a set U2 of k2(9, a, 7) points at random from Si. Let S2 = {i G Si\vf} 2 (i) > (a — 7/2) | f/2 1}- 

• Let S 3 = {i G V\v%(i) > (a - 7 /2)|5 2 |}. Add 5 3 to the list C. 

Output: List of sets C. 



We now show how Lemmas Q] and [2] can be used to identify and enumerate communities. 

Theorem 1 Fix a (9, a, ft) -self -determined community S. Let 7 = a — /3, fci(#,a,7) = log (16/7)/a, 
k 2 (9,a,j) = -llogf^-), N 2 (9,a,j) = 0((9ki) k2 log (1/«J)). Using Algorithm^together with Algo- 
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rithm\3\for steps (I) and (2) of Algorithm\J\ we have that with probability > 1 — 5 one of the elements in the 
list C we output is identical to S. 

Proof: Since when running Algorithm [2] we search over all subsets of U of V of size ki(9,a,j), by 
Lemma Q] in one of the rounds we find a set U s.t. the set of points Si that are voted by at least an element 
in U cover a 1 — 7/I6 fraction of S. So, C\ contains a rough approximation to S. 

Since |5| = t, U2 is a set of k 2 elements drawn at random from S = SnS\ with probability > {t / {2t9ki)) k2 . 
Therefore for N 2 = 0((2(9fci) fc2 log(l/<*)), with probability > 1-5/2 in one of the rounds the set U2 is a 
set of &2 elements drawn at random from S. In such a round, by Lemma |2j with probability > 1 — 5/2 we 
get a set S2 such that |A(52, S)\ < ^yt/8. A simple calculation shows that S3 = S. ■ 

Corollary 1 The number of (9, a, j3)-self-determined communities in an affinity system (V, II) satisfies 

\0( 1 log f^22Mdlhl)) 

B(n) = n O(log(l/7)/a) ( e\o Z {ih) \ \^ V or, J J and wUh pmbabiUty > 1 _ i/ n we can find all 
of them in time 1? (re) poly (n). 

We note that Theorem Q] and Corollary [U apply even if some nodes do not list all members of V in their 
preference lists, and then some nodes in a community S have fewer than 9\S\ votes in total. If 9, a, and (3 
are constant, then Corollary[T]shows that the number of communities is O (n log Qh)l°i\ which is polynomial 
in n and they can be found in polynomial time. We can show that the dependence on n 1 /" is necessary: 

Theorem 2 For any constant 9 > 1, for any a > 2\/^/re 1 / 4 , there exists an instance such that the number 
of (9, a, (3)-self-determined communities with a — f3 = 7 = a/2 is n n<yl / a \ 

Proof Sketch: Consider L = yjn blobs B\, Bl each of size yjn. Assume that each point ranks the 
points inside its blob first (in an arbitrary order) and it then ranks the points outside its blob randomly. One 
can show that with non-zero probability for I < n 1 / 4 /(2^) any union of I blobs satisfies the (9, a, /^-self- 
stability property with parameters a = 1/1 and 7 = a/2. Full details appear in Appendix lA.il ■ 

3.1 Self-determined Communities in Weighted Affinity Systems 

We provide here a simple efficient reduction from the weighted case to the non-weighted case. 

Theorem 3 Given a weighted affinity system (V, A), 9, a, [3, e < a, and a community size t, there is an 
efficient procedure that constructs a non-weighted instance (V , II) along with a mapping f from V to V, 
s.t. for any (9, a, 0) community S in V there exists a (9, a — e, (5) community S' in (V', II) with f(S') = S. 

Proof: Given the original weighted instance (V, A), we construct a non-weighted instance (V',n) as fol- 
lows. For each s € V, we create a blob B s of k nodes in V'. For any s, s G V, if p is the weight with 
which s votes for s, we connect B s to Bg with Gfc^ufcj, where Gfc^ufcj is a bipartite graph with k nodes 
on the left and k nodes on the right such that each edge on the left has out-degree [pk\ and each node on the 
right has in-degree [pk\ . Clearly all nodes in V rank at most k\S\9 other nodes (and do not have an opinion 
about the rest). Let k = 1/e. Consider a community S in (V, A). For any s£S and for each node in i G B s 
the total vote from nodes in B$ for s 6 S (when evaluating whether UsesBg is a good community or not) 
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is at least a|5|fc — \S\ > k\S\(a — e). Moreover, for s £ S and for each node in B s we have the total vote 
from the nodes in B$ for s G S is at most f3\S\k. Therefore U^esBs is a legal (9, a — e, {$) -self-determined 
community of size kt in the non- weighted instance (V, A). ■ 

Using this reduction we immediately get the following result: 

Theorem 4 For any 9, a, f3, 7 = a — (3, the number of weighted (9, a, /?) -self-determined communities is 

o( 1 lo ( 6 log t 1 /^ \\ 

B(n) = ( w / 7 )Ofl°g(i/7)/tt) ^eiogji/ 7 ) ^ ^ }) and we can find them in time B{n) V o\j{n). 

Proof: We perform the reduction in Theorem [3] with 6 = 7/2 and use the algorithm in Theorem Q] and the 
bound in Corollary [TJ The proof follows from the fact that the number of vertices in the new instance has 
increased by only a factor of 2/7. We also note that each set output on the reduced instance can then be 
examined on the original weighted affinity system, and kept iff it satisfies the community definition with 
original parameters. ■ 

3.2 Self-determined Communities in Multi-faceted Affinity Systems 

A multi-faceted affinity system is a system where each node may have more than one rankings of other 
nodes. Suppose that each element i is allowed to have at most / different rankings irj, . . . , w{ . We say 
that the pair (S, tp) is a multi-faceted community where -ip : S — > {1, . . . , /}, if S is a community where 
ip(i) specifies which ranking facet should be used by element i. In other words, as before, let cj) s ^{i) ■= 

\{s G S\i G vrf s) (l : Then (S,iJj) is a (a, p, 0)-multifaceted community if for all i G S, 

4,v>« ><*\S\, and for all j $ S, ^Jj) < (3\S\. 

We show that for a bounded /, even though there may be exponentially many functions tp, it is not harder 
to find multifaceted communities than to find regular communities. Note that all our sampling algorithms 
can be adapted as follows. Once a representative sample . . . , i^} of the community S is obtained, we 
can guess the facets if)(ii), ■ ■ ■ ^{ik) while adding a multiplicative f k factor to the running time. We can 
thus get the set 52 approximating S in the same way as it is found in Algorithms [2] and [3] while adding 
a multiplicative factor of / fc i+ fc 2 t the running time. We thus obtain a list C that for each multi-faceted 
community (S, ift) contains a set S2 such that A (62, S) < "yt/8. Given S2 we can output S with probability 
> y- 81o s n /7 j2\ guess a set U2 of m = 81ogn/7 2 points in S2', guess a function ^2 on U2', output S = the 
set of points that receive at least (a — j/2)t votes according to (£72, ^2)- Moreover, a facet structure rj/ can 
be recovered on S so that (S, ip') is an (a — 7/4, /3 + 7/4, 9) -multifaceted community using a combination 
of linear programming and sampling. Details appear in Appendix IA.3I 

Theorem 5 Let S be an f -faceted (a, (3, 9)-community. Then there is an algorithm that runs in 0(n 2 ) time 
and outputs S, as well as a facet structure ip' on S such that (S, ip') is an (a — 7 / '4, (3 + 7 / '4, 9)-multifaceted 

community with probability > (/ • „)-0(tog(l/ 7 )/a) f f^Mn} ) ~°(^ log ( 0^7) f -o(\o g n/^) 

4 A Local Algorithm for Finding Self-determined Communities 

In this section we describe a local algorithm for finding a community. Given a single element v and the 
target community size t, the goal of the algorithm is to output a community S of size t containing v. Let us 
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fix a target community S that we are trying to uncover this way. 

We note that we need a > 1/2 for a local algorithm that uses only one seed to succeed. If a < 1/2 then one 
may have a valid (0, a, /?)-community that is comprised of two disjoint cliques of vertices. In this case, no 
local algorithm that starts with just one vertex as a seed may uncover both cliques, however we can extend 
the construction below if we start with 0(1/ a) seeds. Below, we focus on providing a local algorithm for 
a > 1/2. Our local algorithm will follow the structure of the generic Algorithm Q] The main technical 
challenge is to provide a local procedure for producing rough approximations. In general, it is not possible 
to do so starting from any seed vertex v G S. For example, if v is a super-popular vertex that is voted 
first by everyone in V, then v will belong to all communities including S, but v would contain no "special 
information" that would allow one to identify S. However, we will show that a constant fraction of the 
nodes in S are sufficiently "representative" of S to enable one to recover S. 

Let us fix t and 9. For an element v, we let R(v) be a uniformly random element which receives v's vote 
with these parameters. In other words, R(v) := uniform element of 7r„(l : ■ t). We start with the main 
technical claim that enables a local procedure for producing rough approximations. 

Lemma 3 Let S be any (9, a, j3)-community of size t. Let rj := 2a — 1 > 0. Then there is a subset T C S 
such that \T\ > rjt and for each pair v £ T and u G S, we have Pr[R(R(v)) = u] > — . 

Proof: For each element v G S denote by Og(v) := tt v (1 : ■ t) Pi S - the elements of S that v votes for, 
and by Is(v) := {u G S : v G ir u (l : 0-t)} - the elements of S that vote for v. By the community property 
we know that | Jg(u)| > at for all v G S. Observe that 

J2\o s (v)\=J2\w\>^ 2 - 

Hence at least an 77-fraction of v's in S must satisfy |Og(i;)| > t/2, where rj = 2a — 1. Let T := 
{v : \Os(v)\ > t/2} C S. For any v G T and any u G S, we have 

\O s (v) n I s (u)\ > \O s (v)\ + \I s (u)\ -t>(a- 1/2) • t. 

To finish the proof note that 

Pr[*(*(t,)) = u] > Pr[R(v) G O s (v) fl I s (u)] • ± > ( " ~ ' - • ^ = 

■ 

We call any vertex v in the set T in Lemma [3] a "good seed vertex" for S. Lemma [3] suggests a natural 
procedure (Algorithm @]) for generating a rough approximation in a local way given a good seed vertex. 

Algorithm 4 Generate rough approximations 

Input: Preference system (V, IT), information / (parameters 9, a, (3, 7, vertex v, size t). 

• Set Si = <ju : Pr[n = > ^^}. 

Output: List of sets C = {Si}. 
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Theorem 6 Assume a > 1/2. Letk 2 (9,a^) = O ^>sW&y^-i/2)) \ ^2(0,0,7) = ij^f ^ log(l/5). 
Assuming v is a good seed element for a community S, then by using Algorithm^together with Algorithm^ 
for steps (1) and (2) of Algorithm^ we have that with probability > 1 — 5 we will output S. 

Proof: It is enough to show that each iteration of the purification algorithm (Algorithm [3]> has a probability 
> (^jaP) 2 to output 5. Since v is a good seed element of S, the set S\ produced by Algorithm [4] must 



contain S. It is easy to see that \S%\ < t9 2 /(a - 1/2). Thus, applying Lemma |2] with M = 9/(a - 1/2) 
we see that if the points of ~U% are drawn uniformly from S, then with high probability S2 is 7/8-close to S, 
and 53 = S. Since conditioned on U2 Q S, U2 is uniform in S, our probability of success is given by the 

/ \S\ \ k2 f a—ll2\ k2 

probability that U2 C S, which is equal to ( -feK J > ( — gr— J » which completes the proof. ■ 

Note that when a > 1/2, /?, and 9 are constants, the purification procedure will run in a constant number of 
iterations. Our main result of this section is the following: 

Theorem 7 Suppose a > 1/2. Assume a, (3, 9, and 5 are constants. If v is chosen uniformly at random 
from S, then with probability at least (2a — 1)(1 — 5) we can find S in time 0(t log t). 

Proof: First, by Lemma 3, with probability at least 2a — 1, element v is such that for all u G S, we 
have Pr[R(R(v)) = u] > . We now implement Algorithm 4 by performing ( Jfy 2 J log(2t/5) 

random draws from R(R(v)) and letting Si be the set of points u hit at least 41og(2i/<5) times. By Chernoff 
bounds, for each u £ S, we have included u in Si with probability at least 1 — e _81o g( 2 */' 5 )/ 8 = 1 — 
5/(2t), so with probability at least 1 — 5/2 we have Si 5 S. Furthermore, since we only include points 
hit at least 41og(2i/<5) times, we have |5i| < ( a^i^ ) • Thus, the analysis in Theorem 3 implies that 
the purification step (Algorithm 3) will succeed with probability at least 1 — 5/2 for a choice of N2 = 

- _i/ 2 ) log(2/(5). Putting these together yields the desired success probability. Furthermore, since 

a, (3, 9, 5 are constants, the overall time is 0(t log t). ■ 

It is not hard to see that the algorithm in Theorem [6] will work even if t is given to it only up to some small 
multiplicative error. As a corollary of Theorem |6l we see that the number of communities is actually linear 
and we can find all of them in quasilinear time. 

Theorem 8 Suppose that a > 1/2. The total number of (9, a, (3)-self-determined communities is bounded 
by O In • m j n ( 7i iy2-a) ' ( a-1/2 ) ) > ^ 0(n) if a, j3, and 9 are constants. 

Proof: It is easy to see that executing the Algorithm in Theorem [7] where we only do one iteration of the 
purification step (i.e., of Algorithm [3]) with inputs t' G ((1 — e)t, (1 + e)t), a' = a — 4e, f}' = (3 + 4e, 
9' = 9(1 + e), and an appropriate seed vertex v G S will lead to a discovery of an (9, a, /3)-community 

( a _l I2\ fc 2(S,a,7) 

of size \S\ = t with probability > p := I — J , as long as e is sufficiently small. Here it is 

enough to take e = min(7, a — l/2)/100. Thus a pair (v, t 1 ), where v is a vertex and t' is the target size 
corresponds to at most l/p distinct communities. Moreover, each community S of size t corresponds to 
more than t(2a — 1) /2 such pairs. Since t' needs only to be within a multiplicative (1 + e) from t, we can 
always select t' from the set of values {(1+e) 1 : i = 0, 1, . . . , [log 1+e n] }. For each value t', the number of 
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communities of size between t! and t'(l + e) is thus bounded by the number of possible pairs {f , v) (= n), 
times 1/p and divided by t'(2a — l)/2: 

71 1 IV 

^{communities of size between t and t (1 + e)} < — ■ 



f (2a-l)/2 

Summing over the possible values of t! we obtain the upper bound: 

q2 \ k 2 (d,a,i) 



n 



e(2a-l) Va-V 2 , 
which leads to the bound in the statement of the theorem. ■ 

Note: We can extend our local approach to weighted and multi-faceted affinity systems. See Appendix IA.4I 
4.1 An Alternative Non-local Algorithm 

The analysis in this section suggests an alternative way for generating rough approximations in the non-local 
model which leads to an algorithm that provides asymptotically better bounds than Theorem[T]in interesting 
cases, in particular when 9, a, and 7 are constants and there is a large gap between a and 7. This leads to 
an improved polynomial bound of n ^ og ^ l / a ^°^ on the number of (9, a, /3)-self-determined communities 
when 9, a, and 7 are constants using Algorithm [5J 

Algorithm 5 Generate rough approximations 

Input: Preference system (V, II), information / (parameters 9, a, (3, size t). 

• Set C = 0; 7 = - a. 

• Exhaustively search over all subsets Uo of V of size [(log l/a)/a\ + 1; for each Uq to the C the set 

si ■= {x : E yeC / Pr [* = R ( R (y))] > 

Output: List of sets C. 



Theorem 9 Fix a (9, a, j3)-self-determined community S. Let 7 = a — ft, k% (9, a, 7) = O (log (1/a) /a), 
fc 2 (0,a,7) = O (7^ log (^t))> N 2 (9,a,j) = 0((9 2 /a 3 ) k2 log (1/«J)). Using Algorithm\5\together with 
Algorithm\3\for steps (I) and (2) of Algorithm^ then with probability > 1 — 5 one of the elements in the 
list C we output is identical to S. 



PROOF Sketch: By using a reasoning similar to the one in Lemma [2] we can show that there exist a set 
Uq of [(log 1/a) /a] +1 points such that the subset U\ of points voted by at least a member in Uq contains 
> 1 — a/2 fraction of S. We show in the following that the corresponding set Si indeed covers S. Fix a 
vertex x G S. We need to show that 

Pr[x = R(R(y))} > 



" " y,n ~ 29H 
yeu 

Let Q C S be the set of elements that vote for x. We know that \Q\ > at, since x G S. Thus 

\UiHQ\ > \Ui\ + \Q\ - \S\ > at/2. 
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Each z £ U1P1Q contributes at least l/6 2 t 2 to the sum YlyeU ^ r [ x = ^C^(z/))]- Thus this sum is at least 
(at/2) ■ (l/9 2 t 2 ) = a/(29 2 t). Hence x G Si, as required. Moreover, by observing that 

£ £ Pr[x = R(R(y))] = £ £p r [s = < 1/a 2 , 

we obtain \Si\ < 

Since when running Algorithm [5] we exhaustively search over all subsets of U\ of V of size ki(9, a, 7), in 
one of the rounds we find a set Ui s.t. |5i| < S C Si. So, £1 contains a rough approximation to S 1 . 
Finally, using a reasoning similar to the one in Theorem [TJ we get the desired conclusion. ■ 

Theorem [9] gives asymptotically better bounds than Theorem [TJ when Ni = n k ^ 9 ' a ^ is the dominant term 
in the bound (e.g., when 9, a, and 7 are constants) and especially when there is a large gap between a and 
7 - since ki is reduced from log(16/7)/a to [~log(l/a)/a] + 1. On the other hand, Theorem|9]has worse 
dependence on 9 and a in N2, so for certain parameter settings, Theorem [TJ can be preferable especially if 
one optimizes the constants in Lemmas [TJ and [2] based on the given parameters. 



5 Self-determined Communities in Social Networks 

In this section we present a natural notion of self-determined communities in social networks and discuss 
how our analysis sheds light on the notion of (a, /3)-clusters lfT6l [171 [T2l . We assume that the input is a 
directed graph G = (V, E) and for a vertex i we denote by di its out-degree. As discussed in Section [TJ 
given a social network we can consider the affinity system induced by direct lifting and then consider self- 
determined communities in that affinity system. This leads to the following very natural notion: 

Definition 3 Let G = (V, E) be a directed graph and let 9, a, f3 > with f3 < a < 1. Consider the affinity 
system (V, ai, . . . , a n ) where ay = Wij if(i,j) S E and ciij = otherwise. A subset S cy is a (9, a, j3) 
self-determined community in G if it is a (9, a, (3) weighted self-determined community in (V, ai, . . . , a n ). 

Note that when evaluating a community of size t each node i is allowed a total vote of at most 9t. One 
natural way to achieve this is to only fractionally count edges from high-degree nodes i, giving them weight 
mm(9t/di, 1) when evaluating a community of size t in the induced weighted affinity system. 

The community notion introduced in lfl6l[T7l is as follows: 

Definition 4 Let a, f3 with f3 < a < 1 be two positive parameters. Given an undirected graph, G = (V, E), 
where every vertex has a self -loop, a subset S C V is an (a, j3)-cluster if S is: 

(1) Internally Dense: Vi G S, \E(i,S)\ > a\S\. 

(2) Externally Sparse: Mi (£ S, \E(i,S)\ < P\S\. 

The (a, /3)-cluster notion resembles our community notion in Definition [3] In particular, in the case where 
the graph is undirected, Definition [3] is similar to Definition HJ except that in the case of our Definition [3] 
each node % is allowed a total vote of at most 9t. As discussed above one way to achieve this is to only 
fractionally count edges from high-degree nodes i, giving them weight mm(0\S\/di, 1). This distinction is 
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crucial for getting polynomial time algorithms. From our results in the previous sections we have that every 
graph has only a polynomial number of communities satisfying Definition [3] and moreover, we can find all 
of them in polynomial time. In contrast, as we show, there exist graphs with a superpolynomial number of 
(a, /3)-clusters. 

Theorem 10 For any constant e, a = 1, a — f3 = 1/2 — e, there exist instances with n s '' ogn ' (a, j3) -clusters. 

Proof: Consider the graph G n;p with p = 1/2'. Consider all (?) sets of size k = 21o , gn (l — 5), where S is a 
constant (determined later). For each such set S, the probability it is a clique is 



We now want to show that conditioned on S being a clique, it is also an (a, /3) -cluster with probability at 
least 1/2. This will imply that the expected number of (a, /3)-clusters is at least 



Fix such set of size k = 21 " g " (1 — 5). The probability that a node outside is connected to more than a 
(1/2 + e) -fraction of the set is upper bounded by 



By imposing | — (1 + e)(l — 5) < — 1 + log n (2), we get that this probability is upper bounded by l/(2n). 
So by union bound over all nodes we then get the desired result. We need to impose (1 + e)(l — 5) — | > 
1 + log n (2). This is true for 5 < e/4 and / > 12/e and n large enough. ■ 

We note that for certain range of parameters our bounds in Theorem[T3]this improves over the general upper 
bound given in |fl6l[T7 1. Moreover, we show that even in graphs with only one (a, /3)-cluster, we show that 
finding this cluster is at least as hard solving the planted clique problem for planted cliques of size 0(log n), 
which is believed to be hard (see, e.g., Hazan and Krauthgamer ifTTTD . 

The Hidden Clique Problem: In this problem, the input is a graph on n vertices drawn at random from the 
following distribution G n ^ii ,k pick a random graph from G n ^ii and plant in it a clique of size k = k(n). 
The goal is to recover the planted clique (in polynomial time), with probability at least (say) 1/2 over the 
input disuibution. The clique is hidden in the sense that its location is adversarial and not known to the 
algorithm. The hidden clique problem becomes only easier as k gets larger, and the best polynomial-time 
algorithm to date [ 1 ], solves the problem whenever k = Q,(y/n). Finding a hidden clique for k = c log n for 
any c is believed to be hard. The decision version of this problem is also believed to be hard. 

We begin with a simpler result that finding the approximately-largest (a, /3)-cluster is at least as hard as the 
hidden clique problem. 

Theorem 11 Suppose that for a = 1 and (3 — a = 1/4, there was an algorithm that for some constant 
c could find an (a, f3)-cluster of size at least MAX/c, where MAX is size of the largest community with 
those parameters. Then, that algorithm could be used to distinguish (1) a random graph G n 1/2 from (2) a 
random graph G ni i/2 in which a clique of size 2clog 2 (n) has been planted. 



P 



(2) > (l/2)^ 2 / 2 = (1/2) 2 l °s 2 n(i-8) 2 /e _ n ~k(l-S) 




= n" 
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Proof: We can show that with probability at least 1 — 1/ n the largest clique in G n ^/ 2 largest clique has size 
2 log(n), which implies the largest (a, (3) cluster (with a = 1 and (3 — a = 1/4) has size at most 2 log(n). 
On the other hand we can also show that with probability at least 1 — 1/n, for c > 8 In 2 the planted clique 
of size 2clog 2 (n) is a cluster with these parameters. Thus, under the assumption that distinguishing these 
two cases is hard, the problem of finding the approximately-largest (a, /3)-cluster is hard. ■ 

We now show that in fact, even finding a single (a, /3)-cluster is as hard as the hidden clique problem. Here, 
instead of G n i/2 we will use G n , P for constant p > 1/2. Note that the hidden clique problem remains hard 
in this setting as well| 

Theorem 12 For sufficiently small (constant) 7 and e, with probability at least 1 — 3/n, we have that: (1) 
the graph G nj i_ 7 _ e has no (1,1 — 7) clusters; and (2) a hidden clique of size -j log n is on (1, 1 — 7) cluster. 
Therefore, finding even one such cluster is as hard as the hidden clique problem. 

Proof: Consider G n>p for p = 1 — 7 — e. We start by showing that with probability at least 1 — 1/n the size 
of the largest clique is at most ^73^^ ■ For any k, the probability that there exists a clique of size k is at 
most 



^ (2) < —„k 2 /2 -k/2 



A ,pw < — p* l*p~ 



For k = , 7 1 21n n . = -2 log- n, this is 

ln(l— 7— e) °P ' 



V_^_ -21og p n 2(log p n) 2 _ P * /2 _ »» _ r\ 

k\ P ~ kl ~ k\ ~ [ n } - 

This immediately implies that with probability at least 1 — 1/n, G rhP does not contains any (1,1 — 7) clusters 
of size greater than ln ^ 1 2 _ 1 "" e) ■ 

We now show that with probability at least 1 — 1/n, G n , v does not contain any (1,1 — 7) clusters of size 

— inFi-7-e) • ^ or tms ' we w ^ snow mat f° r an Y set 5* °f s i ze < ln^i-^-g) an( ^ an ^ no( ^ e v not i n S' trie 
probability that v connects to at least (1 — r y)\S\ nodes inside S is at least 1/y/n. Because these events are 
independent over the different nodes v, this implies that the probability that no node v outside S connects 

/ \ n ~ ^ r~ 

to at least (1 — 7)|5| nodes inside S is at most I 1 \= I < e^^ n ' 2 . By union bound over all sets S 



-2 Inn 



of size at most ln ^" 1 2 _ 1 ° ^ , this will imply that the probability there exits a (1, 1 — 7) cluster of size at most 



r~ , r is at most 1/n. 

ln(l— 7— e) 1 

Consider a set S of size k and a node v outside S. The probability that v connects to more than (1 — 7) k 
nodes inside S is at least 



(1 " 7 " e ) (1 " 7)fc (7 + ey k > I i^^-Y (1 " 7 - e ) (1 ^ )fe (7 + eV k . 
This follows from the fact that 



k \ _ k(k-l)...(k-jk + l) ((l-7)A;)T fc _ 1 ( [1 - ~i)ke\ lk 



jk J k(-yk/e)"! k k\ 7/c 



3 In particular, if it were easy, then one could solve the decision version of the hidden clique problem for G n l/ / 2 by first adding 
additional random edges and then solving the problem for G„ tP . We assume here that the planted clique has size greater than the 
largest clique that would be found in G(n,p). 
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where we use the fact that ( 7 fc)! < 2^/2^k( 1 k / e)^ k < k(jk/e)^ k . 

So, the probability that v connects to more than (1 — j)k nodes inside S is at least 



i(l-7-e)' 



1 - 7 7 + e 



- 7& 



7 1 — 7 — e 

-2 Inn 



>[(l_ 7 _ c ) e 7]fcl. 



This is decreasing with k and thus it suffices to consider k = ln (x ■ For this k, we get that the probability 
that v connects to more than (1 — 7) k nodes inside S is at least 

I 2 7 Inn I „ 2 7 

-e~ 21llri e~Ml-7- E ) = _ n _2_ ln(l- 7 -e). 

We want this to be greater than 1/y/n, and thus it suffices to have — 2 — ln (il 7 ) > —0.4. This holds for 
7 = 0.1, e = 0.01. 

Finally, it is easy to show that with probability at least 1 — 1/n, a hidden clique of size k = ^ In n 
is a (1, 1 — 7) cluster. This follows by noticing that every vertex outside the clique has in expectation 
k(l — 7 — e) connections insides the clique, so by Hoeffding bounds, the probability it has more than 
k(l — 7 — e) + ek = k(l — 7) neighbors inside the clique is at most 1/n 2 . By union bound, we get that with 
probability at least 1 — 1/n every vertex outside the clique has at most fe(l — 7) neighbors inside the clique 
so the planted clique is a community as desired. ■ 
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A Additional Proofs 

A.l Finding Self-determined Communities in Quasi-Polynomial Time 

We present here a simple quasi-polynomial algorithm for enumerating all the self-determined communities. 
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Theorem 13 For any 0, a, /3, 7 = a — f3, there are n°( lo s n /7 2 ) se ( S which are (9, a, /3) (weighted) self- 
determined communities. All such communities can be found by using Algorithm^with parameters 9, a, (3, 
7 = q — /3 and k(^) = 2 log (4n)/7 2 . 

Proof: Fix a (9, a, /3) (weighted) self-determined community S. We show that there exists a multiset U of 
size £(7) = 21og (4n)/7 2 such that the set Sjj of points in V that receive at least (a — r y/2)\U\ amount of 
vote from points in U is identical to S. The proof follows simply by the probabilistic method. Let us fix 
a point i G V . By Hoeffding, if we draw a set U of 2 log (An) / 7 2 uniformly at random from S, then with 
probability 1 — l/(2n), the average amount of vote that i receives from points in U is within 7/2 of the 
average amount of vote that i receives from points in S. By union bound, we get that with probability at 
least 1/2, for all points in V the average amount of vote that they receive from points in U is within 7/2 of 
the average amount of vote that they receive from points in S. Using this together with the definition of a 
self-determined community, we get that with probability 1/2 we obtain Sjj = S for U of size 2 log (4n)/7 2 
drawn uniformly at random from S. This then implies that there must exist a multiset U of size £(7) such 
that Su = S. 

Since in Algorithm [6] we exhaustively search over all multisets U (of point from V) of size ^(7), we clearly 
get the list L we output contains all the (9, a, (5) (weighted) self-determined communities. Moreover, clearly, 
n O(iogn/7 ) j s an U pp er bound on the number of (9, a, j3) (weighted) self-determined communities. ■ 

Algorithm 6 Algorithm for enumerating self-determined communities 
Input: Affinity system (V, II), parameters 9, a, (5, 7; ^(7); 

• SetL = 0. 

• Exhaustively search over all multisets U with elements from V of size £(7). 

• For t = 1 to n (determining the meaning of "vote for") do: 

• Let Sjj be the subset of points in V that receive at least (a, — r y/2)\U\ amount of vote from 
points in U. Add Sjj to the list C. 

• Remove from the list C all the sets that are not (9, a, /3) weighted self-determined communities. 
Output: List of self-determined communities L. 



A.2 Additional Proofs in Section |3] 

THEOREM [2]For any constant 9 > 1, for any a > 2v / #/ra 1 / 4 , there exists an instance such that the number 
of (9, a, 0) -self-determined communities with j3 — a = 7 = a/2 is n^ 1 /"). 

Proof: Consider L = ^fn blobs B%, Bl each of size y/n. Assume that each point ranks the points inside 
its blob first (in an arbitrary order) and it then ranks the points outside its blob randomly. We claim that with 
non-zero probability for I < n 1 ^/(2V9) any union of I blobs satisfies the (9, a, /3)-self-stability property 
with parameters a = l/l and 7 = a/2. 

Let us fix a set S which is a union of I blobs. Note that for each point i in S, the expected number of points 
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in S voting for i is 

yjn + {lyjn - \Jn)- 



n — y/n 

Also, for a point j not in 5 the expected number of points in S voting for j is 

l\fn — < l\fn — — < Vn 4, 

n — \ n n 



for I < n 1 / 4 / (2\f9). By Chernoff, we have that the probability that j is voted by more than \/n/2 is at most 

e -v^/48_ 

By union bound, we get that the probability that there exists a set S which is a union of I blobs that does not 
satisfy the (9, a, /3)-self-stability property with a = 1/7, 7 = 9/2 is at most 

for I < n 1 ' A /{2V9). ■ 

Corollary Q] The number of (9, a, /3)-self-determined communities in an affinity system (V, TV) satisfies 

, Q(log ( 1 h) /a) ( l°g ( 1/7) \ u V r 1 g V ^ 



S(n) = n^swTj/oO f ^ilim j v» ^ " and with probability > 1 - 1/n we can find all of 

them in time S(n)poly(n). 

Proof: Consider a community size t. For any (9, a, /3) -self-determined community 5* and let p$ be the 
probability that S is in the list output by Algorithm in Theorem [T] with parameters 9, a, (3, t. By TheoremQ] 
we have that ps > 1 — 5. By linearly of expectation we have that YlsPs * s the expected number of 
(9, a, /3) -self-determined communities in the list output by our algorithm. Combining these, we obtain that 

B(n)(l -S)< ZsPs < N l (S)N 2 (8) where h = log (16/ 7 )/a, k 2 (5) = ^ log N^S) = n k ^ 

and N 2 (S) = O^k^ 2 ^ log(l/<f)). By setting 5 = 1/2, we get the desired bound, 



B(n) = n °( lo s( 1 /7)/«) 



*iog(i/ 7 )\°(**(^)) 



a 



Let N = iVi(l/2)iV2(l/2)n. By running the algorithm in Theorem Q] 2 log [N] times we have that for each 
(9, a, f3) -self-determined community 5, the probability that 5 is not output in any of the runs is at most 
(1/2) 2 log(Ar) < 1 /N 2 . By union bound, with probability at least 1 - 1/n, we output all of them. ■ 

A.3 Self-determined Communities in Multi-faceted Affinity Systems 

Recall that a multi-faceted affinity system is a system where each node may have more than one rankings 
of other nodes. This may reflect, for example, that a person may have two rankings of other people, one 
corresponding to personal friends (in descending order of affinity), and one of co-workers. Suppose that each 
element i is allowed to have at most / different rankings irj, . . . , ir{ . We say that the pair (S, ift) is a multi- 
faceted community where tp : S — > {1, . . . , /}, if 5 is a community where ip(i) specifies which ranking 
facet should be used by element i. In other words, as before, let <p e s ^(i) := \{s S S\i G Trf^ s \l : \9\S\~] )}|. 
Then (S,tp) is an (a, /3, 6>)-multifaceted community if for all i £ 5, ^5 > &\S\, and for all j ^ S, 
4>%^j) < P\S\- 
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For a bounded /, it is not harder to find multifaceted communities than to find regular communities. Note 
that in all our sampling algorithms can be adapted as follows. Once a representative sample {h, ■ ■ ■ ,ik} 
of the community S is obtained, we can guess the facets ip(ii), ■ ■ ■ , ip(ik) while adding a multiplicative f k 
factor to the running time. We can thus get the set S2 approximating S in the same way as it is found in 
Algorithms [2] and |3] while adding a multiplicative factor of / fc i+ fe 2 to jjjg runn j n g time. We thus obtain a list 
C that for each multi-faceted community (5, ip) contains set S2 such that A(S , 2, S) < "ft/8: 

\o( 1 log ( ^22sMM. \\ 

Claim 1 We can output a list Cof(f ■ n)°( lo s (V7)/«) ^ L °s (V7) j V ^ J J sets> such thatfor 

each multi-faceted community S there is an S2 £ C such that A(5 < 2, S) < "ft/ 8. 

It remains to show that: 

Lemma 4 Suppose that (S, ip) is a valid (a, (3, 6)-multifaceted community of size t. Given t and a set S2 
such that A(S < 2, S) < "ft/8, there is an algorithm that outputs S with probability > J^ 81o s n /7 j2. 

Moreover, a facet structure ip' can be recovered on S so that (S, if)') is an (a— 7/4, ^+7/4, 9)-multifaceted 
community. 

Together with Claim [T] Lemma|4] shows that multifaceted communities can indeed be recovered in polyno- 
mial time. 

Theorem [5] Let S be an /-faceted (a, /3, 9) -community. Then there is an algorithm that runs in 0(n 2 ) time 
and outputs S, as well as a facet structure ip' on S such that (S, ifi') is an (a — 7/4, j3 + 7/4, 9) -multifaceted 
community with probability at least 

ft a\ n/ n\ -of^iogf^siiM)) 
(/ n yOaoz(lH)/ a ) f f ■ & (I/7) j ^ ^ a ~< >> y-OClogn/T 2 ). 

Proof:(of LemmalU). The algorithm is very simple. Guess a set U2 of m = 8 log n/7 2 points in S2; guess a 
function ^2 on [^2; output 5 = the set of points that receive at least (a — "f/2)t votes according to (U2, 1^2)- 

Note that in the non-faceted case, by Hoeffding's inequality, with probability > 1/2 selecting a set U2 as 
above and then selecting those points that receive at least (a — "f/2)t votes from U2 would have yielded S. 
This is because each element of S receives at least (a — "f/8)t votes from elements of S2, while each element 
of the complement S c receives at most (/? + "f/8)t votes from elements of S*2. This reasoning extends to 
the multifaceted setting, provided, the function ^2 coincides with the function tp on the elements of U2 H S. 
This indeed happens with probability > f~\ U2 \ = j-8iogn/7 ^ completing the proof of the first part of the 
lemma. 

For the second part of the lemma we assume that the set S is known and we need to recover the facets ip' 
that make S a community. Note that this step is necessary in order to verify that S is indeed a multifaceted 
community. There are two cases to consider. 

Case 1: t < 8 log n/7 2 . In this case we can find ip by exhaustively checking all possibilities in time 
0(q 81ogn ^ 2 ), which is the same as the probability of success of the first step. 

Case 2: t > 8 log n/7 2 . In this case we use linear programming to find a fractional version ipf of the 
function ip first. In other words, we find a function ipf : S X {1, . . . , q} — > [0, 1] such that (£, -0/) is a 
"community" on average: 
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1. for all seS,J2l = i^f(s,i) = l; 

2. for all x£ S, J2 se s £f=i ^/( s > *) ' X x &<i(iM) ^ at > 

3. for all y £ S, J2 se s Ef=i ^/( s > *) • X»e*i(i:tt) < 

This linear program is feasible, since the original ^ is an integral solution to it. As a result, we obtain a 
fractional solution ipt satisfying the three conditions. To obtain ift' we round tpf by sampling. In other 
words, we set tp'(s) = i with probability ipf(s,i). By Hoeffding's inequality, since t > 81ogn/7 2 , the 
sampling will preserve conditions 2 and 3 that were imposed on f/^ up to an additive error of 7/4. Thus, by 
definition, (S,tp r ) wm be an (a — 7/4, /3 + 7/4, #)-multifaceted community. ■ 

A.4 Extensions to weighted affinity systems and to the local model 

We note that that Algorithm [4] can be combined with our reduction from weighted to unweighted communi- 
ties to obtain a local algorithm for finding communities in the weighted case. 

Extending the local approach to the multi-faceted setting is more involved, since the definition of R(v) 
would need to be adapted to this setting. Indeed, the multi-faceted version Rf(v) of R(v) can be taken to be 
a random element voted by a random facet i of v. Then Algorithm 0] can be adapted by taking the threshold 
to be ("~ 1 / 2 )/( 6> / where / is the number of facets. Note that while an approximation to any community 
S can be found locally in near-linear time, finding the exact community S as well as the facet structure on 
S as in Lemmag]will still take /0(i°s«/7 2 ) time. 
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